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Abstract. Wc present two data-driven procedures to estimate the transition density of an 
homogeneous Markov chain. The first yields to a piecewise constant estimator on a suitable 
random partition. By using an Hellinger-type loss, we establish non-asymptotic risk bounds for 
our estimator when the square root of the transition density belongs to possibly inhomogeneous 
Besov spaces with possibly small regularity index. Some simulations arc also provided. The 
second procedure is of theoretical interest and leads to a general model selection theorem from 
which wc derive rates of convergence over a very wide range of possibly inhomogeneous and 
anisotropic Besov spaces. We also investigate the rates that can be achieved under structural 
assumptions on the transition density. 



Consider a time- homogeneous Markov chain (Xj)jgp^ defined on an abstract probability space 
(i7,f,P) with values in the measured space (X,J^, ^). We assume that for each x G X, the 
conditional law \ Xi = x) admits a density s(x, •) with respect to ji. Our aim is to 

estimate the transition density (x, y) i-)- s{x,y) on a subset A = Ai x A2 of from the 
observations Xq,. . . , Xn- 

Many papers are devoted to this statistical setting. A popular method to build an estimator 
of s is to divide an estimator of the joint density of (Xj, Xj+i) by an estimator of the density of 
Xi. The resulting estimator is called a quotient estimator. Roussas (1969), Athreya and Atun- 
car (1998) considered Kernel estimators for the densities of Xi and (XjjXj+i). They proved 
consistence and asymptotic normality of the quotient estimator. Other properties of this estima- 
tor were established: Roussas (1991), Dorea (2002) showed strong consistency, Basu and Sahoo 
(1998) proved a Berry-Essen type theorem and Doukhan and Ghindes (1983) bounded from 
above the integrated quadratic risk under Sobolev constraints. Clemencon (2000) investigated 
the minimax rates when A = [0, 1]^, X^ = M^. Given two smoothness classes ^1 and ^2 of real 
valued functions on [0, 1]^ and [0, 1] respectively (balls of Besov spaces), he established the lower 
bounds over the class 



He developed a method based on wavelet thresholding to estimate the densities of Xi and 
(Xj,Xj_|_i) and showed that the quotient estimator of s is quasi-optimal in the sense that the 
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minimax rates are achieved up to possible logarithmic factors. Lacour (2008) used model selec- 
tion via penalization to construct estimates of the densities. The resulting quotient estimator 
reaches the minimax rates over ^ when ^\ and ^2 are balls of homogeneous (but possibly 
anisotropic) Besov spaces on [0, 1]^ and [0, 1] respectively. 

The previous rates of convergence depend on the smoothness properties of the densities of Xi 
and (Xj,Xj_|_i). In the favourable case where Xq, . . . are drawn from a stationary Markov 
chain (with stationary density /), the rates depend on the smoothness properties of / or more 
precisely on the restriction of / to ^1. This function may however be less regular than the 
target function s. Wc refer for instance to Section 5.4.1 of Clemencon (2000) for an example 
of a Doeblin recurrent Markov chain where the stationary density / is discontinuous on [0, 1] 
although s is constant on [0, 1]^. Therefore, these estimators may converge slowly even if s is 
smooth, which is problematic. 

This issue was overcome in several papers. Clemencon (2000) proposed a second procedure, 
based on wavelets and an analogy with the regression setting. He computed the lower bounds 
of minimax rates when the restriction of s on [0, 1]^ belongs to balls of some (possibly inho- 
mogenous) Besov spaces and proved that its estimator achieves these rates up to a possible 
logarithmic factor. Lacour (2007) established lower bound over balls of some (homogenous but 
possibly anisotropic) Besov spaces. By minimizing a penalized contrast inspired from the least- 
squares, she obtained a model selection theorem from which she deduced that her estimator 
reaches the minimax rates when A = [0, 1]^, = M^. With a similar procedure, Akakpo and 
Lacour (2011) obtained the usual rates of convergence over balls of possibly anisotropic and 
inhomogeneous Besov spaces (when = ^ = [0, l]^'^). Very recently, Birge (2012) proposed a 
procedure based on robust testing to establish a general oracle inequality. The expected rates of 
convergence can be deduced from this inequality when ^/s belongs to balls of possibly anisotropic 
and inhomogeneous Besov spaces. 

These authors have used different losses in order to evaluate the performance of their estima- 
tors. In each of these papers, the risk of an estimator s is of the form E [5^(sl^, s)] where \a 
denotes the indicator function of the subset A and b a suitable distance. Lacour (2007) , Akakpo 
and Lacour (2011) considered the space L^(X^, M) of square integrable functions on X^ equipped 
with the random product measure M = A„ (8) where A„ = X^r=o ^'^'^ used the distance 
defined for /, /' G I? (X^ , M) by 

n-l „ 

5'(/,/') = -E / (/(^i,y)-/'(^^,y))'dM(y). 

^ z=0 

Birge (2012) considered the cone Lj',_(X^,/i ® fi) of non-negative integrable functions and used 
the deterministic Hellinger-type distance defined for /, /' G LJ!,_(X^, fi) by 

S\fJ') = \j {^/fWy)- y^fi^y dfx{x)dfi{y). 

These approaches, which often rely on the loss that is used, require the knowledge (or at least a 
suitable estimation) of various quantities depending on the unknown s, such as the suprcmum 
norm of s, or on a positive lower bound, cither on the stationary density, or on k^^ X]j=i ^^^^^^ 
for some A; > 1, / > where s^^^^\x, ■) is the density of the conditional law i2(X;_|_j | Xq = x). 
Unfortunately, these quantities not only influence the way the estimators are built but also their 
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performances since they are involved in the risk bounds. In the present paper, we shall rather 
consider the distance H (corresponding to an analogue of the random loss above) defined on 
the cone L^(X^,M) of integrable and non-negative functions by 

H\f, /') = ^ E _^ (/f(^ - Vf'iXi,y)y d/x(y) for all /, /' G h^X^ M). 

For such a loss, we shall show that our estimators satisfy an oracle- type inequality under very 
weak assumptions on the Markov chain. A connection with the usual deterministic Hellinger- 
typc loss will be done under a posteriori assumptions on the chain, and hence, independently of 

the construction of the estimator. 

Our estimation strategy can be viewed as a mix between an approach based on the mini- 
mization of a contrast and an approach based on robust tests. Estimation procedures based on 
tests started in the seventies with Lucien Lecam and Lucien Birge (LeCam (1973, 1975); Birge 
(1983); Birge (1984a, b)). More recently, Birge (2006) presented a powerful device to establish 
general oracle inequalities from robust tests. It was used in our statistical setting in Birge (2012) 
and in many others in Birge (2007, 2008) and Sart (2012). We make two contributions to this 
area. Firstly, we provide a new test for our statistical setting. This test is based on a variational 
formula inspired from Baraud (2010) and differs from the one of Birge (2012). Secondly, we 
shall study procedures that are quite far from the original one of Birge (2006). Let us explain 
why. 

The procedure of Birge (2006) depends on a suitable net, the construction of which is usually 
abstract, making thus the estimator impossible to build in practice. In the favourable cases 
where the net can be made explicit, the procedure is anyway too complex to be implemented 
(see for instance Section 3.4.2 of Birge (2007)). This procedure was afterwards adapted to 
estimators selection in Baraud and Birge (2009) (for histogram type estimators) and in Baraud 
(2010) (for more general estimators). The complexity of their algorithms is of order the square 
of the cardinality of the family and are thus implementable when this family is not too large. 
In particular, given a family of histogram type estimators {sm^rn G M.}, these two procedures 
are interesting in practice when is a collection of regular partitions (namely when all its 
elements have same Lebesgue measure) but become unfortunately numerically intractable for 
richer collections. In this work, we tackle this issue by proposing a new way of selecting among 
a family of piecewise constant estimators when the collection M. ensues from the adaptive 
approximation algorithm of DeVore and Yu (1990). 

We present this procedure in the first part of the paper. It yields to a piecewise constant 
estimator on a data-driven partition that satisfies an oracle-type inequality from which we shall 
deduce uniform rates of convergence over balls of (possibly) inhomogeneous Besov spaces with 
small regularity indices. These rates coincide, up to a possible logarithmic factor to the usual 
ones over such classes. Finally, we carry out numerical simulations to compare our estimator 
with the one of Akakpo and Lacour (2011). 

In the second part of this paper, we are interested in obtaining stronger theoretical results for 
our statistical problem. We put aside the practical considerations to focus on the construction 
of an estimator that satisfies a general model selection theorem. Such an estimator should be 
considered as a benchmark for what theoretically feasible. We deduce rates of convergence over 
a large range of anisotropic and inhomogeneous Besov spaces on [0, 1]^*^. We shall also consider 
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other kinds of assumptions on the transition density. We shall assume that s belongs to classes 
of functions satisfying structural assumptions and for which faster rates of convergence can 
be achieved. This approach was developed by Juditsky et al. (2009) (in the Gaussian white 
noise model) and by Baraud and Birge (2011) (in more statistical settings) to avoid the curse of 
dimensionality. More precisely, Baraud and Birge (2011) showed that these rates can be deduced 
from a general model selection theorem, which strengthen its theoretical interest. This strategy 
was used in Sart (2012) to establish risk bounds over many classes of functions for Poisson 
processes with covariates. We shall use these assumptions to obtain faster rates of convergence 
for autorcgrcssivc Markov chains (whose conditional variance may not be constant). 

This paper is organized as follows. The first procedure, which selects among piecewise constant 
estimators is presented and theoretically studied in Section 2. In Section 3, we carry out a 
simulation study and compare our estimator with the one of Akakpo and Lacour (2011). The 
practical implementation of this procedure is quite technical and will therefore be delayed in 
the appendix, in Section 5. In Section 4, we establish theoretical results by using our second 
procedure. The proofs are postponed to Section 6. 

Let us introduce some notations that will be used all along the paper. The number x V y 

(respectively x Ay) stands for max{x,y) (respectively min(.x,y)) and X-f stands for x V 0. We 
set N* = N \ {0}. For [E, d) a metric space, x G E and A C E, the distance between x and A is 
denoted by d{x, A) = infaeA d{x, a). The indicator function of a subset A is denoted by 1a and 
the restriction of a function / to A by /|^. For all real valued function f on E, \\f\\oo stands for 
sup^g^ |/(x)|. The cardinality of a finite set A is denoted by |^|. The notations C,C' ,C" . . . are 
for the constants. The constants C,C',C". . .may change from line to line. 



Throughout this section, we assume that X = M'', ^ = [0, If'^, //([0, 1]'^) = 1 and n > 3. 

2.1. Preliminary estimators. Given a (finite) partition m of [0, 1]^*^, a simple way to esti- 
mate s on [0, l]^'' is to consider the piecewise constant estimator on the elements of m defined 



In the above definition, the denominator X^"=o Jx '^K{Xi,x) dn{x) may be equal to for some 
sets K, in which case the numerator Y^^=o ^K{Xi, Xi+i) = as well, and we shall use the 
convention 0/0 = 0. 

We now bound from above the risk of this estimator. We set 



2. Selecting among piecewise constant estimators. 



by 



(1) 





and prove the following. 
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Proposition 1. For all finite partition m of [0, 1]'^'^, 

CE [H^{slA,Sm)] < E [H^{slA,Vrn)] + ^^^^|m| 

n 

where C = 1/(4 + log 2). 

Up to a constant, the risk of Sm is bounded by a sum of two terms. The first one corresponds 
to the approximation term whereas the second one corresponds to the estimation term. 

An analogue upper bound on the empirical quadratic risk of this estimator may be found 

in Chapter 4 of Akakpo (2009). Her bound requires several assumptions on the partition m 
and the Markov chain although the present one requires none. However, unlike hers, we lose a 
logarithmic term. 

2.2. Definition of the partitions. In this section, we shall deal with special choice of par- 
titions m. More precisely, we consider the family of partitions defined by using the recursive 
algorithm developed in DeVore and Yu (1990). For j G N, we consider the set 

Cj = {l = {h, hd) e N^'^, 1 < < 2^' for 1 < i < 2d} 

and define for all 1 = (Zi , . . . , ha) € -Cj , 

VzG{l,...,2d}, Ij{k) = 

We then introduce the cube Kj \ = ni=i hi^i) — ^ ^ ^j}- 

The algorithm starts with [0, 1]^'^. At each step, it gets a partition of [0, 1]^*^ into a finite 
family of disjoint cubes of the form Kj \. For any such cube, one decides to divide it into the 4*^ 
elements of fCj+i which are contained in it, or not. The set of all such partitions that can be 
constructed in less than I steps is denoted by M.^. We set A^oo = ^t>i-M.i- Two examples of 
partitions are illustrated in Figure 1 (for d=l). 



k-i k. 

2i ' 



if k < 2^ 
if L = 2^. 



Figure 1. Left: example of a partition of A^2- Right: example of a partition of Ai^. 



6 MATHIEU SART 

2.3. The selection rule. Given ^ G N* U {oo}, the aim of this section is to select an estimator 

among the family {sm, m € Me}- 

For any K £ U^g^v^^m and any partition m' G Ai^, let m' V K he the partition of K defined 

by 

m'\/K = {K'n K, K' em',KnK' 0}. 
Let L be a positive number and pen be the non-negative map defined by 

pen (m' \/ K) = V K| logra ^ ^ ^ ^ \J^^j^ m. 

^ ' n 

Let us set a = (1 - l/V2)/2 and for all /, /' € L^(X2, M), 

(2) r(/, /') = 2^ E + [/IV^) - y/f{X~yj) di^iy) 



2" i=0 -^^ 
We define 7 for m G Me by 

[Kem"''^^i l\K'em' J 

— pen(m' V K)] } + 2pen(m). 
Finally, we select m among A^^ as any partition satisfying 

(3) < inf 7(m) H — 

meMi n 

and consider the resulting estimator s = s^. 

Remarks. The estimator s = s{L, i) depends on the choices of two quantities L > 0, £ G N*U{oo}. 
We shall see in the next section that L can be chosen as an universal numerical constant. As 
to £, from a theoretical point of view, it can be chosen as ^ = 00. In practice, we recommend 
to take it as large as possible. Nevertheless, the larger i, the longer it takes to compute the 
estimator. A practical algorithm in view of computing rh will be detailed in the appendix. 

The selection procedure we use may look somewhat unusual. It can be seen as a mix between 
a procedure based on a contrast function (which is usually easy to implement) and a procedure 
based on a robust test (the functional T(/, /'), which can be seen as a robust test between /, 
will allow us to obtain risk bounds with respect to a Hellinger-type distance). This functional is 
inspired from the variational formula for the Hellinger affinity described in Section 2 of Baraud 
(2010). 
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2.4. An oracle inequality. The main result of this section is the following. 

Theorem 2. There exists an universal constant Lq > such that, for all L > Lq, I G N*U{oo}, 
the estimator s = s{L,i) satisfies 

(4) CE [H^ (si A, s)] < inf {E [H^ {si a, Vm)] + pen(m)} 
where C is an universal positive constant. 

In the literature, oracle inequalities with a random quadratic loss for piccewise constant esti- 
mators have been obtained in Lacour (2007) and Akakpo and Lacour (2011). Their procedures 
require a priori assumptions on the transition density and the Markov chain although ours re- 
quires none (except homogeneity). However, unlike theirs, our risk bound involves an extra 
logarithmic term. We do not know whether this term is necessary or not. 

In tlie proof, we obtain an upper bound for Lq which is unfortunately very rough and useless 
in practice. It seems difficult to obtain a sharp bound on Lq from the theory and we have rather 
carried out a simulation study in order to tune Lq (see Section 3). 

2.5. Risk bounds v^^ith respect to a deterministic loss. Although the distance H is nat- 
ural, we are interested in controlling the risk associated to a deterministic distance. To do so, 
we shall make a posteriori assumptions on the Markov chain. 

Assumption 1. The sequence (Xi)i>o is stationary and admits a stationary density ip with 
respect to the Lebesgue measure fi on W^. There exists kq > such that ip{x) > kq for all 
X G [0,1]'=^. 

We introduce L^^ ([0, 1]^*^, (y? ■ /x) (g) /i) the cone of integrable and non-negative functions on 
[0, 1]^^ with respect to the product measure {(p- jjl. We endow ]L3i_([0, 1]^*^, ((^ • )u) (g) /i) with 
the distance h defined by 

V/,/'GL^([0,lp^(<^-/i)(gM), h\fj') = \ [ (^/fi^) - ^/ri^)\{x)dxdy. 

In our results, we shall need the ^S-mixing properties of the Markov chain. We set for all g G N* 

Pq= \s'''^\x,y) - (p{y)\(p{x)dxdy 

where s^'^\x, ■) is the density of the conditional law C{Xq \ Xq = x) with respect to the Lebesgue 
measure. We refer to Doukhan (1994) and Bradley (2005) for more details on the ^-mixing 
coefficients. 

Theorem 3. Under Assumption 1, the estimator s built in Section 2.3 with £ G N* and L > Lq, 
satisfies 

CE [h^ {si A, s)] < inf {h^ {sIa, Vm) + pen(m)} + 

meMe n 

where 

(5) «.(£) = „2-^i„fJexp(-^^) 

and where C is an universal positive constant. 
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This result is interesting when the remainder term Rn{£)/n is small enough, that is when 2^^^ 
is small compared to n and when the sequence {(3q)q>i goes to fast enough. More precisely, 
Rn{£) can be bounded independently of n, £ whenever £, d, n and the Pq coefficients satisfy the 
following. 

• If the chain is geometrically /3-mixing, that is if there exists 6i > such that Pq < e~^i*, 
then 

Rn{£) < nH^^^^^ 



exp(-6in) + exp (^- ^ j + exp - y — ^ 



In particular, if £, d, n are such that 2^'^ <n/ log^ n, Rn{£) is upper bounded by a constant 
depending only on HQ,hi. 
• If the chain is arithmetically ^-mixing, that is if there exists 62 > such that Pq < g"**^, 
then 

Kg 

where C'{b2) depends only on 62. Consequently, if 2^*^ < n-^~''/logn and 62 > 5/^ — 4 
for ( G (0, 1), Rn{£) is upper bounded by a constant depending only on kq, &2- 

2.6. Rates of convergence. The aim of this section is to obtain uniform risk bounds over 
classes of smooth transition densities for our estimator. 

2.6.1. Holder spaces. Given cr € (0, 1], wc say that a function / belongs to the Holder space 
^'"([0, l]^'') if there exists \ f\^ G M+ such that for aU (xi, . . . , X2d) G [0, if'^ and ah 1 < j < 2d, 
the functions /,(•) = /(xi, . . . , Xj-i, ■, xj+i, X2d) satisfy 

\fj{^) - fjiy)\ < l/UI^ - yr for all x,ye [0, 1]. 
When the restriction of s/s to A = [0, 1]^*^ is Holderian, we deduce from (4) the following. 
Corollary 1. For all a G (0, 1] and G ^"^([0, 1]^'^), the estimator s = s{Lq,oo) satisfies 

logn\ '^+^ logn 



CE[i/=(,l.,»)]<(d|^U|,)^^(!f^)'" + 



n 



where C is an universal positive constant. 



2.6.2. Besov spaces. A thinner way to measure the smoothness of the transition density is to 
assume that belongs to a Besov space. We refer to Section 3 of DeVore and Yu (1990) for 

a definition of this space. Wc say that the Besov space ^^(LP([0, l]^*^)) is homogeneous when 
p >2 and inhomogeneous otherwise. We set for all p G (1, +00) and a G (0, 1), 

^(LaO,lJ ))-j^^(L^([o^i]2.)) ifpG[2,+oo), 

and denote by | • |p^o- the semi norm of ^'^(LP([0, l]^*^)). We make the following assumption to 
deduce from (4) risk bounds over these spaces. 

Assumption 2. There exists k > such that for all i & {0, . . . ,n — 1} , X-i admits a density ipi 
with respect to the Lebesgue measure fi such that ipi{x) < k for all x G [0, 1]'^. 
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Note that we do not require that the chain be either stationary or mixing. 

Let (L2([0, 1]'^'^, 11 /u), ^2) , be the metric space of square intcgrablc functions on [0, 1]'^'^ with 
respect to the Lebesgue measure. Under the above assumption, we deduce from (4) that 

CE [H'' {sIa, s^)] < Jni ^ !^i^dl {^/sU, Kn) + Lo^^^"" 

When v^l^ belongs to a Besov space, the right-hand side of this inequahty can be upper bounded 
thanks to the approximation theorems of DeVore and Yu (1990). 

Corollary 2. Suppose that Assumption 2 holds. For all p G {2d/{d+ l),+oo), a G {2d{l/p — 
1/2)+, 1) and ^/s\^ G ^'^{U'{[0, 1]'^'^)), the estimator s = s{Lo,oo) satisfies 

a 

logn\ <^+<* logn 



(6) c-iE[//^(.i.,«)]<|v/;u|f ('f^)'*V 

where C > depends only on K,a,d,p. 



n 



More precisely, it is shown in the proof that the estimators s = s{Lo,i) satisfy (6) when i is 
large enough (when £ > (i~^(log2)~^ logn). 

Rates of convergence for the deterministic loss h can be established by using Theorem 3 
instead of Theorem 2. For instance, if the chain is geometrically /3-mixing, we may choose £ the 
smallest integer larger than d~^(log 2)~^ log(n/ log^ n), in which case the estimator s = s{Lq,£) 
achieves the rate (logn/n)'^/(^+'^) over the Besov spaces ^''{IJ'{[0, l]^'^)), p G (2d/(d + 1), +00), 
a G (a"i(p, d),l) where 

(71 (p, d) = ^(-l + 4 (l/p - 1/2)+ + ^Jl + 24 (l/p - 1/2)+ + 16 (l/p - 1/2)^^ . 

If the chain is arithmetically /3- mixing with bq < q^^, choosing £ the smallest integer larger 
than ci^^ (2 log 2)^^ log(n/ logn) allows us to recover the same rate of convergence when a G 
((T2(p, d), 1) where 

a2(p, d) = d ({l/p - 1/2)+ + ^2(l/p-l/2)+ + (l/p-l/2)^^ . 

We refer the reader to Section 6.7 for a proof of these two results. 

In the literature, Lacour (2007) obtained a rate of order n-^/('^+^) over ^'^{h^{[0, 1]^)), which 
is slightly faster but her approach prevents her to deal with inhomogcneous Besov spaces and 
requires the prior knowledge of a suitable upper bound on the supremum norm of s. As far as 
we know, the rates that have been established in the other papers hold only when a > 1. 



3. Simulations. 

In this section, wc present a simulation study to evaluate the performance of our estimator 
in practice. We shall simulate several Markov chains and estimate their transition densities by 
using our procedure. 
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3.1. Examples of Markov chains. We consider Markov chains of the form 

Xk+i = F{Xk, Uk) 

where F is some known function and where f/^ is a random variable independent of (^O) • • • > ^k)- 

For the sake of comparison, we begin to deal with examples that have already been considered 
in the simulation study of Akakpo and Lacour (2011). In each of these examples, Uk is a standard 
Gaussian random variable. 

Example 1. Xk+i = 0.5Xk + (1 + Uk)/4: 

Example 2. Xk+i = 12"^ (6 + sin(12Xfc - 6) + {cos{Xk - 6) + 3)Uk) 
Example 3. 

Xk+i = ^ (^fc + 1) + - ^ (^/3(5X,/3, 4, 4) + i{5Xi - 2)/3, 400, 400)^ ^ Uk 

where P{-,a,b) is the density of the /3 distribution with parameters a and b. 
Example 4. 

Xk+i = \ (giXk) + 1) + lUk 

where g is defined by 

g[x) = 1^ exp (-18(x - 1/2)2) _^ (-162(a; - 3/4)^) for all x G M. 

At first sight. Examples 1 and 2 may seem to be different than those of Akakpo and Lacour 
(2011). Actually, we just have rcscaled the data in order to estimate on [0, 1]^. The statistical 
problem is the same. According to Akakpo and Lacour (2011), we set p large {p = 10^) and 
we estimate the transition densities of Examples 1, 2, 3 and 4 from {Xp, . . . , Xn+p) so that the 
chain is approximatively stationary. 

We also propose to consider the following examples. In Example 5, i7fc is a centred Gaussian 
random variable with variance 1/2, in Example 6, Uk admits the density 

f{x) = ^ [exp (-50(x - 1)2) + exp {-50x^)] 

with respect to the Lebesgue measure, and in Example 7, U^ is an exponential random variable 
with parameter 1. 

Example 5. Xt+i = 0.5Xk + (1 + C/fc)/4. 
Example 6. X^+i = 0.5 (X^ + Uk) ■ 
Example 7. Xk+i = Xk/{50Xk + 1) + XkUk- 

We set Xq = 1/2 and estimate s from {Xq, . . . , X„). These last three Markov chains are not 
stationary. Their transition densities are rather isotropic and inhomogeneous. The transition 

density of Example 7 is unbounded. 

In what follows, our selection rule will always be applied with L = 0.03 (whatever, i, n and 
the Markov chain). 
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3.2. Choice of I. We discuss tiie choice of I by simulating the preceding examples with n = 10^ 
and by applying our selection rule for each value of ^ G {1, . . . , 10}. The results are summarized 
below. 





Ex 1 


Ex 2 


Ex 3 


Ex 4 


Ex 5 


Ex 6 


Ex 7 


1 


0.031 


0.046 


0.299 


0.181 


0.089 


0.291 


0.358 


2 


0.011 


0.015 


0.087 


0.107 


0.024 


0.170 


0.241 


3 


0.011 


0.014 


0.026 


0.058 


0.013 


0.067 


0.156 


4 


0.011 


0.018 


0.026 


0.035 


0.015 


0.046 


0.113 


5 


0.011 


0.018 


0.022 


0.038 


0.015 


0.048 


0.098 


6 


0.011 


0.018 


0.022 


0.038 


0.015 


0.048 


0.065 


7 


0.011 


0.018 


0.024 


0.038 


0.015 


0.048 


0.044 


8 


0.011 


0.018 


0.024 


0.038 


0.015 


0.048 


0.040 


9 


0.011 


0.018 


0.024 


0.038 


0.015 


0.048 


0.040 


10 


0.011 


0.018 


0.024 


0.038 


0.015 


0.048 


0.040 



Figure 2. Hellinger risk //^(sl[o,i]2, s). 



When £ grows up, the risk of our estimator tends to decrease and then stabilize. The best 
choice of I is obviously unknown in practice but this array shows that a good way for choosing i 
is to take it as large as possible. This is theoretically justified by Theorem 2 since the right-hand 
side of inequality (4) is a non-increasing function of i. 

3.3. An illustration. We apply our procedure for Examples 1 and 6 with n = 10^, (. = 7. We 
get two estimators and draw them with the corresponding transition density in Figure 3. 




Example 1. Example 6. 

Figure 3. Estimator and transition density. 



This shows that the selected partition is thinner (respectively wider) to the points where the 
transition density is changing rapidly (respectively slower), and is thus rather well adapted to 
the target function s. 
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3.4. Comparison with other procedures. In this section, we compare our selection rule 
with the oracle estimator and with the piecewise constant estimator of Akakpo and Lacour 
(2011). 

The procedure of Akakpo and Lacour (2011) amounts to selecting an estimator among {sm, m G 
A^'} where Sm is defined by (1) and where M' is a collection of irregular partitions on [0, 1]^. 
Precisely, with their notations, we apply it with = 5, pen(m) = 3||su||oo|?Ti|/n and with 
pen(m) = Sp^* ||oo|?7i|/w where m* is a partition suitably chosen (following the recommenda- 
tions of Akakpo and Lacour (2011), that is J, = 3). These two estimators are denoted by s^^^ 
and s*^^-* respectively. Notice that these penalties, which arc used in their simulation study, are 
not the ones prescribed by their theory. Their theoretical penalties also depend on a positive 
lower bound on the stationary density. 

We denote by s^°) the oracle estimator, that is the estimator defined as being a minimizer of 
the map m ^ i?^(sl[o,i]2, s^) for m G M.7. This estimator is the best estimator of the family 
{sm, m G M-t} and is known since the data are simulated. We consider the random variables 



7^,: 



F2(s1 



[0,1] 



2,S 



for i = 0,1, 2 



and denote by qQ{a) the a-quantile of TZq. Results obtained are given in Figure 4. 





Ex 1 


Ex 2 


Ex 3 


Ex 4 


Ex 5 


Ex 6 


Ex 7 


E[i?^(.slfo,ii2,s)] 


0.011 


0.017 


0.022 


0.038 


0.018 


0.052 


0.049 




0.007 


0.011 


0.015 


0.028 


0.012 


0.037 


0.041 


Qo(0.5) 


1.473 


1.513 


1.443 


1.369 


1.422 


1.420 


1.200 


go(0.75) 


1.698 


1.627 


1.557 


1.440 


1.575 


1.481 


1.244 


go(0.9) 


1.921 


1.834 


1.683 


1.509 


1.749 


1.543 


1.290 


90(0.95) 


2.113 


1.965 
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1.558 


1.839 


1.590 


1.317 




0.017 


0.018 


0.028 


0.058 


0.024 


0.103 




p (7^2 < 1) 


0.964 


0.740 


0.908 


1 


0.984 


1 




E[i7^(slfo,ii2,si^^)] 


0.013 


0.018 


0.028 


0.062 


0.023 


0.096 


0.133 


p (7^3 < 1) 


0.832 


0.748 


0.928 


1 


0.948 


1 


1 



Figure 4. Risks for simulated data with n = 1000 averaged over 250 samples. 



3.5. Comparison with a quadratic empirical risk. In Akakpo and Lacour (2011), the risks 
of the estimators arc evaluated with a empirical quadratic norm and we can also compare the 
performances of our estimator to theirs by using this risk. 

To do so, let us denote by || • ||„ the empirical quadratic norm defined by 

ll/lln = -E/ f\Xi,x)dx forall/GL2(M^M) 

i=i 

and set for i G {1, 2}, 

„' _ pl[o,i]2 - s\\l 
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The results obtained are presented in Figure 5. They are very similar to those of Figure 4. 





Ex 1 


Ex 2 


Ex 3 


Ex 4 


Ex 5 


Ex 6 


Ex 7 


H\\st[o^ip - s\\i] 


0.064 


0.108 


0.229 


0.319 


0.116 


0.528 


2.82 


E[||.lfo,il^-5W||^] 


0.147 


0.133 


0.257 


0.423 


0.205 


0.743 






0.980 


0.820 


0.788 


0.984 


0.992 


1 






0.091 


0.129 


0.262 


0.418 


0.159 


0.739 


6.08 


p (7^3 < 1) 


0.864 


0.780 


0.792 


0.980 


0.940 


1 


1 



Figure 5. Risks for simulated data with n = 1000 averaged over 250 samples. 



4. A GENERAL PROCEDURE. 

In Section 2, we used our selection rule to establish the oracle inequality (4), from which we 
deduced rates of convergence over Besov spaces ,^'^(LP([0, l]^'^)) with a lower than 1. We now 
aim at obtaining rates for more general spaces of functions. This includes Besov spaces with 
regularity index larger than 1 and spaces corresponding to structural assumptions on s. We 
propose a second procedure to reach this goal. 

The Markov chain takes its values into X and we estimate s on a subset A of the form 
A = Ai X A2. We always assume that n > 3. 



4.1. Procedure and preliminary result. Our second procedure is defined as follows. Let 
a = (1 — l/V2)/2, L > 0, 5 be an at most countable set of LY(X^, M) and A5 > 1 be a map 
on S. 



We define the application p on 5 by 



aH\f,f')+Tif,f')-L^^ 



pU) = sup 

/'65 L 



We select s among S as any element of S satisfying 



+ for all / G S. 



n 



p(S)<mfrt/) + -. 



We prove the following. 



Proposition 4. Suppose that f{x) = for all f E S and x eX^\A and that^^^g e ^sif) < i 
There exists an universal constant Lq > such that if L> Lq, the estimator s satisfies 



(7) 



CE [H'^{s1a,s)] < E 



'M\H\slA,f) + L^ 
feS [ n 



where C is an universal positive constant. 
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4.2. A general model selection theorem. We shall deduce from the above proposition a 
model selection theorem by choosing suitably S. To do so, we consider the following assumption. 

Assumption 3. For all i £ {1, . . . , n — 1}, Xi admits a density ipi with respect to some known 
measure v such that i^{Ai) = 1. Moreover, there exists k such that ifi{x) < k for all x E Ai and 

ie{l,...,n-l}. 

We define L^(A, (gi /j,) the space of square integrable functions on A with respect to the 
product measure u ® ji, and we endow it with its natural distance 

d\f,f')= [ dKx)dM2/) for all /, /' G ( A ® /i) • 

J A 

Hereafter, a model y is a (non-trivial) finite dimensional linear space of L?{A, v ® 

Let us explain how to obtain a model selection theorem when Assumption 3 holds. Let V 
be a collection of models V and let (A(F))ygv be a family of non- negative numbers such that 
Eveve"^^^^ < 1- For each model F G V, we consider an orthonormal basis (/i, . . . , /dimv) 
of V and set 

V n dim V 



{diraV 
j=l 



We deduce from Lemma 5 of Birge (2006) that the cardinal of Sy = / £ Ty, d{f, 0) < 2} 

IS upper bounded by |5y| < (30n)'^'™^/l We then use the above procedure with S = Uv&ySv 
and 

Asif) = inf {A(F) + (dim V) log (30n) /2} for all f e S. 

SvBf 

This yields to an estimator s such that 



C'E \H^(s1a,s)] < inf < 



/ -e ^2/^, a\ A(l^) + dim(l^)logn 

d(/,0)<2 



where C" is an universal positive constant. Since d{-\/s\^,0) < 1, 

inf d^^/IU,f)=d^^/-sU,Tv). 

<i(/,0)<2 

For all f eV, there exists f e Ty such that d^{f,f)<n~^ and thus 
Precisely, we have proved: 

Theorem 5. Suppose that Assumption 3 holds. Let V he an at most countable collection of 
models. Let (A(F))ygv he a family of non-negative numhers such that 

There exists an estimator s such that 

CE [HH.l,, «)] < inf, U (751., V) + 
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where C > depends only on n. 

The condition X^veV e""^^^^ < 1 can be interpreted as a (sub)probability on the collection V. 
The more complex the family V, the larger the weights A(y). When one can choose A(y) 
of order dim(y), which means that the family V of models does not contains too many mod- 
els per dimension, the estimator s achieves the best trade-off (up to a constant) between the 
approximation and the variance terms. 

This theorem holds under an assumption that is very mild and weaker than those of Lacour 
(2007), Akakpo and Lacour (2011) and Clemencon (2000). Birge (2012) proved a general oracle 
inequality when there exist integers k>l and / > and positive numbers p, q such that 

1 ^ 

£>< -^s('+-'^(x,y) < p foraUxji/GX 
i=i 

where the parameters k,l,Q are known. Our assumption is then satisfied for the Markov chain 
(X^+i, . . . , Xn) with u = fj, and k = kp. 

We shall consider subsets ^ C L^(^, i/ (g) /x) corresponding to smoothness or structural as- 
sumptions on \/s|^. For such an we associate a collection V and deduce from Theorem 5 
a risk bound for the estimator s when -v/sU belongs to This set is a generic notation and 
will change from section to section. In the remaining part of this paper, we shall always choose 
= M2d^ ^ = [0, l]^'^ and /x the Lebesgue measure. 

4.3. Smoothness assumptions. We have introduced in Section 2.6 the isotropic Besov spaces 
^^(L*'([0, 1]^*^)) where a G (0,1). In this section, we consider the anisotropic Besov spaces 
^^(LP([0, 1]2<^)) where o" = (ai, . . . , a2d) belongs to (0, +00)^^^. 

Intuitively, a function / on [0, l]^'^ belongs to ^^{W^O, 1]2<^)) if, for all j 2d}, and 

xi, . . . , Xj-i, Xj+i, . . . , X2d € [0, 1] the function 

Xj !-->• f{xi, . . . , Xj-l, Xj, Xj^l, . . . , X2d) 

belongs to ^q'{hP{[0, 1])). In particular, for all a G (0, +00), 

^^{hP{[0, 1]2<^)) = ^J^'--'^)(Lf ([0, ll^*^)). 

A definition of the anisotropic Besov spaces may be found in Hochmuth (2002) (for d = 1) and 
in Akakpo (2009) (for larger values of d). We also consider the space ^"'([O, l]^'') of anisotropic 
Holderian functions on [0, l]^'^ with regularity a. A precise definition of this space may be found 
in Section 3.1.1 of Baraud and Birge (2011) (among other references). 

For all cr = ((7i, . . . , a2d) € (0, -1-00)^*^, we denote by a the harmonic mean of <t: 
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We set for all p e (0, +00], 



^00 V 



(LP([0,1]2'^)) ifpG(0,l] 
=^;(LP([0,1]2'^)) ifpG(l,2) 
=^-(L*^([0,l]2'^)) ifpG[2,+oo) 

and denote by | • jp^o- the semi norm associated to the space ^'^(LP([0, l]^'^)). 

In this section, we are interesting in obtaining a bound risk when \/s|^ belongs to the space 

/ \ 



K[o,ir)= u 

p6(0,+oo] 



U 



^'"(Lf([0,l]2'^)) 



(Te(o,+oo)'* 

\o->2d(l/p-l/2) + 



Families of linear spaces possessing good approximation properties with respect to the elements 
of =^ = ^([0, 1]^*^) can be found in Theorem 1 of Akakpo (2012). We then deduce from Theo- 
rem 5, 

Corollary 3. Suppose that Assumption 3 holds with X = R'^, A = [0, l]^'^ and with v ® ^ the 
Lebesgue measure. There exists an estimator s such that for all ^/s\^ G ^{[0, 1]^*^); 

CE [H^ isU,s)] < (logny/('^'^ ^ logn 



n 



n 



where p £ (0, +00], a G (0, +00)^", a > 2d{l/p - 1/2)+ are such that v^|^ G ^'^(LP([0, l]^"')) 
and where C > depends only on K,d,p,a. 



To our knowledge, the only statistical procedures that can adapt both to possible inhomo- 
geneity and anisotropy of s are those of Akakpo and Lacour (2011) and Birge (2012). The losses 
are different, but the rates are the same as ours (up to the logarithmic term). In view of our 
assumptions, we do not know if the logarithmic term can be avoided. 

In the following sections, we consider classes ^ corresponding to structural assumptions 

on \fs\j!^- More precisely, rates of convergence when the chain is autoregressive with constant 
conditional variance (respectively non constant conditional variance) are established in Sec- 
tion 4.4 (respectively Section 4.5). 



4.4. AR model. In this section, we assume that Xn+i = g{Xn) + £n where g is an unknown 
function and where the e„'s are unobserved identically distributed random variables. Many 
papers are devoted to the estimation of the regression function g and it is beyond the scope of 
this paper to make an historical review for this statistical problem. 

For the sake of simplicity, one shall assume throughout this section that X = M, ^ = [0, 1]^. 
The transition density is of the form s(x, y) = ip{y — g{x)) where is the density of Eq. Since g 
and 99 are both unknown, this suggests us to consider the class 

=^ = (J {/, 3</- G H%R),3g G ^([0, 1]), II5II00 < 00, Vx, y G [0, 1], /(x, y) = <t>{y - g{x))} . 

a>0 
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A family V of linear spaces possessing good approximation properties with respect to the func- 
tions of can be built by using Section 6.2 of Baraud and Birge (2011). Precisely, we prove 
the following. 

Corollary 4. Suppose that Assumption 3 holds with X = M, ^ = [0, 1]^ and with v ® fi the 
Lebesgue measure on M^. Assume that a/^U Mongs to ^ . Let c > 0, p G (0,-|-oo], /3 > 
(1/p — 1/2)+ he any numbers and <p € 'H'^{R), g G 3§^{IJ'{[0, 1])), H^Hoo < oo be any functions 
such that 

\/s{x,y) = (f>{y - g{x)) for all x,y e [0, 1] . 
There exists two estimators > and g such that the estimator s defined by 

six,y) = (4>{y - 9{x))j l[o,i]2(x,y) forallx,yeM. 



satisfies 



where C > depends only on K,p,a,f3, where depends only on p,P,a,\g\p^i3,\\g\\oo,\4'\oo,aAi o,nd 
where C2 depends only on iT,||(7||oo;|0|oo,cr- Moreover, the construction of the estimators g, (f) 
depends only on the data Xq, . . . , Xn- 

In particular, if 4> is very smooth (says a > /3 V 1), the rate of convergence corresponds to the 
rate of convergence for estimating g only (up to a logarithmic term) . 

It is interesting to compare the preceding rate to the one we would obtain under the pure 

smoothness assTimption on \/iU but ignoring that y/s\j^ belongs to To do so, we need to 
specify the regularity of ^/s\^, knowing that of (p and g. This is the purpose of the following 
lemma. 



9(13, a) 



Lemma 1. Let a,l3>0, and let us define 

' 13a if/3,cr<l 
(3 A a otherwise. 

Let (p G ^'"(M), g G ^''([0, 1]). The function f defined by 

f{x, y) = (p{y - g{x)) for all x,y e [0, 1], 
belongs to ^(^(^''^)''^)([0, l]^). 

Moreover, for all a,P>0, there exist (p £ ^'^(M), g G ^^([0,1]) such that the function f 
defined by 

f{x, y) = 4>{y - g{x)) for all x,y e [0, 1] , 
belongs to ^("'''^([0, 1]^) if and only if a < 9{^,u) and b<a. 

This result says that if y/s{x,y) = 4>{y - g{x)), with ^ G 'H'^(M), 3 G '^'^([0, 1]), then ^/s 
is Holderian with regularity (0(/3, cr), cr) on [0,1]^, and this regularity cannot be improved in 
general except in some particular situations. Under such a smoothness assumption, the rate of 
estimation we would get is (logn/ra)^'^^^^''^^/^^'^^^'^''^^"'"^^^''^)"'"'^-'. This rate is always slower than 
the rate obtained under the structural assumption. 
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4.5. ARCH model. Throughout this section, we assume that Xn+i = gi{Xn) + g2iXn)£n 
where 51,52 are unknown functions and where the £„'s are unobserved identically distributed 
random variables. The previous model corresponded to 52 = 1- The problem of the estimation 

of the mean and variance functions gi and 52 was considered in several papers and we refer to 
Section 1.2 of Comte and Rozcnholc (2002) for bibliographical references. 

For the sake of simplicity, one assumes that X = M and ^4 = [0, 1]^. U (p denotes the density 
of £0, the transition density s is of the form 

(8) s{x,y) = \g2{x)\~^(p[g2^{x){y - gi{x))] foranx,yGM. 

We consider thus the class 

■J^llloo < 00, ||'y2||oo < 00, 

Vx,y G [0,1], f{x,y) = ^/\v^{xj\(j) {v2{x){y - vi{x)))^ 
and apply Theorem 5 with a suitable collection V to obtain: 

Corollary 5. Suppose that Assumption 3 holds with X = M, ^4 = [0, 1]^ and with v ® n the 
Lebesgue measure on M?. Assume that \/s|^ belongs to ^ . Let a > 0, (j) & ^'^{M.) and for all 
i G {1,2}, letpi G (0,+oo], Pi > (1/^,-1/2)+, Vi G ^^^{LP^dO,!])), with \\vi\\oo < 00 such that 

Vs{x,y) = ^/\v2{x)\(t>{v2{x){y - vi{x))) for all x,y € [0, 1]. 

Letps G (0, +00] and (33 > (1/^3-1/2)+ be any numbers such thatv3 = ^/\v^\ G ^f^^{U'^{[0, 1])). 
There exists an estimator s such that 

OE [H' I)] < c; C??^) """"" + (f^) 

where [3 = max(/?i, /32, /^s). The constant C > depends only on K,a,pi,p2,P3,(3i,f32,(33, €[ 
depends only on cr,\\vi\\oo^\v2\\oD,\\^\\oD,\vi\pj^,i3i,\v2\p2,i32^V3\p.^^l3.^,\<p\oo,aAi and C2 depends only 
on cr,\\v2\\oo;\f\oo,a- Moreover, the construction of the estimator s depends only on the data 



If s is of the form (8) with ip, gi, 52 smooth, in the sense that (/) = ^/(p G ^'^(M), t"! = 51 G 

^/^i(Lfi([0,l])), lli-illoo <oo,V2= g^^ G ^^^{hP^{[0,l])), \\v2\\oo < 00 and V3 = \g2\-^^^ G 
^/33(]LP3([o, 1])), then belongs to ^. If ^ is sufficiently smooth (cr > /3i V /32 V /Sa V 1), the 

rate becomes 

Up to a logarithmic term, the first term corresponds to the bound we would get if we could 
estimate gi only. The two other terms correspond to the rate of estimation of and |52|~^''^ 
respectively (up to a logarithmic term). 
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Note that if /?2 G (0, 1), one can always choose p3 = 2p2 (with pa = oo if p2 = oo), = (32/2, 
in which case the rate becomes 



C"E [H"^ (s, s)] < max 



2/3i 0-2 

'\og^ n\ 2/31+1 flog^n\ '32+1 



n / \ n 



In some situations however, j3s can be taken larger than ^2- 

As in the preceding section, we may use the lemma below to compare this rate with the one 
we would obtain under smoothness assumptions on \/s|^. 

Lemma 2. Let for all a, ^1,^2 > 0, 



(2-1(^2 A 1)) A c7/3i A (7/32 i/ o- < 1 and /3i A /^a < 1 
(2-^(/32 Al)) Ac7A/3i otherwise. 



Let (j) G n^iR), vi G ^^i([0, l]),V2e ^^^([o^ 1]). T/ie function f defined by 

f{x, y) = ^/\v2{x)\(p {v2{x){y - vi{x))) for all x,y e [0, 1], 
belongs to 'H^m,P2,^M (^[q^ 1^2^ 

Moreover, there exist (p € H^iR), vi G ■H^i([0, 1]), V2 G n^^{[0,l]) such that the function f 
defined by 

f{x, y) = ^/\v2{x)\(t) {v2{x){y - vi{x))) for all x, y G [0, 1], 
belongs to ^('''''^([0, 1]^) if and only if a < 6l(/3i, /32, cr) and b<a. 

This proposition says that if ^y s{x, y) = y^K'2(20|(/> (t'2(-'3:)(y — , with (j) £ H^iR), 
vi G •H'^i([0, 1]), V2 G n^^{[Q, 1]), Vs|^ belongs to %mi,M^^\[Q^ if) and the regularity index 
of this space cannot be increased in general. By Corollary 3, we would get a rate of order 
(log n/n)^^(^i'''^''')^/(^^(^i'^2'''^''+^^^i'''2'^)+'') , which is slower than the one given by Corollary 5. 

5. Appendix: implementation of the first procedure. 

In this section, we explain how to construct in practice the estimator of the first procedure. 
This will lead to the proposition below. 

Proposition 6. For all L > 0, £ & N* , the estimator s = s{L,£) of Section 2.3 can be built in 
less than C [nid + £4^^+'^^'^'^ operations where C is an universal constant. 

We set for all K G Uj^g^n^m, 

TTiZ^ k-^K{Xi,x)diJi{x) 

for all K' G U^gAl^m, 

Fk{K') = aH"^ (skIk', Si^'l/f ) + T {skIk', Si^'li^) , 
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and for all m' E Me, 



Fk{K') j - pen(m' V K). 

K'em' / 



We shall find for each cube K G U^g^n^m, a partition m'^ G Me such that 

(9) iKimx) = sup jK{m'). 

We shall compute then 

(10) min 7(m) = min < > iKitn'x) + 2pen(m) > . 

We shall find m'^ by using a slight adaptation of the procedure of Blanchard et al. (2004). 
Computing (10) is similar. The algorithm we propose is based on the one-to-one correspondence 
between Me and the set Te of 4'^-ary trees with depth smaller than i. 

Lemma 3. There exists tpe a one-to-one map between Me and Te such that for all m G Me, 
tpei^) 'is a tree whose leaves correspond to the elements of the partition m. 

The construction of this map may for instance be deduced from Section 3.2.4 of Baraud and 
Birge (2009). 

Wc need to introduce some notations. For each tree T G Te and bin K" of T, we denote by 
T{K") the subtree of T rooted in K". The set of leaves of T(K") is denoted by C{T{K")). We 
set R{K") the tree reduced to its root K" {i.e, C{R{K")) = {K"}). For all cube K G UmeMem, 
we set 

C{T{K")) \/ K = {K' nK, K' e £{T{K")), K' n K $} 
and we define the function £ by 

£{T{K")) = -\C{T{K")) WK\+ Fk{K'). 

K'eC{T{K")) 

The key point is that computing (9) amounts to finding T* such that 

f(r*([o,i]2'^))= supf(r([o,i]2'^)) 

TeTi 

since m'j^ = ipJ^(T*). 

We now take advantage of the additivity of the function £: if T{K") is not reduced to its 
root, and if K'l, . . . , K'^^ are the cubes of {JmeMi"^ such that K" = uf^iK^', then, 

(11) £{T{K")) = J2nT{Ki')). 

1=1 

For all cube K" G U^e^^m, let T*{K") be a tree (rooted in K") such that 

£{T*{K")) = sup £{T{K")). 

T&Te,T^K" 
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Remark that if K" n -ftT = 0, this supremum is equal to 0, in which case T*(K") will always 
stand for R{K"). In general, we deduce from (11) that 



4d 



(12) 



S{T*{K")) = max S{R{K")),J2^{T*{KI')) 



1=1 



Calculating (9) can thus be completed in that way: we start with the sets K" G ^meMe\Me-i''^ 
with K" n iiT / for which the optimal local trees are reduced to their roots. By using 
relation (12) we find the optimal local trees T*{K") when K" G U^^_^\^^m, K" n K 0. 
Proceeding recursively like this yields to the optimal tree T* = r*([0, l]^*^). 



6. Proofs 

6.1. Proof of Proposition 1. Let us introduce the piecewise constant function 



(13) 



By using the triangular inequality we can decompose the risk of Sm as follows: 



The first term can be bounded from above by (4 + log 2) E [if^(.slyi, Vm)\ thanks to Lemma 2 
of Baraud and Birge (2009). For the second term, we begin to define for K £ m the random 
variable 



Bk 



A 



n-l 



li4:(Xi, Xj+i) — 



i=Q 



\ 



n-l 



+i) I Xi] 



i=0 



Since 2nH'^{sm, Sm) = J^Ki^m^K, we shall bound from above the terms E[Sii-]. For this pur- 
pose, we introduce the stopping time 



T = inf |z > 0, E [lK{Xi,Xi+i) I Xi] > ^1 A (n - 1) 
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with respect to the filtration Tn = o"(^0) • • • ) ^n) generated by the random variables Xq, . . . , X„. 
We set e = 1 + log 2 + 2 log n and we decompose IE[i?x] as follows 



^[Bk] < {l + e)E 



T-l 



\ i=0 



T-l 



. ^E[lK{Xi,Xi+i)\Xi 

\ i=0 



T-l 



(14) 



Yet, 



+ {l + e-^)E 
< 2(l + e)E 

+ {l + £-^)E 



A 



n-l 



i=T 



\ 



n-l 



i=T 



Er=T {tK{Xi,Xi+i) -E[lK{Xi,Xi+i) I Xi]))^ 



E 



T-l 

^E[1k(X,,X,+i) 
L j=o 



<l/2, 



and we control the second term of the right-hand side of inequality (14), by using the claims 
below. 

Claim 1. For all K e m, j & {0, . . . , n}, and A' G J^j = cr{Xo, . . . , Xj), 



E 



Er=7 (lx(Xi,Xi+i) - E [lK{Xi,Xi+i) I X,]))' 

E7=j''^[^KiXi,Xi+,)\Xi] 



1^' 



n— 1 

<^E 

k=j 



E[lKiXk,Xk+i) I Xk] 

Zi=j^[tKiXi,Xi+i)\x, 



Proof of Claim 1. Let us define the random variables 



n-l 



Yn-l{K) =Y,{^K{Xi,Xi+i) -E[lK{X^,Xi+i) \ Xi]) midZn{K) 



l=J 



YriZ}n^K{Xi,Xi+,)\Xi^ 



We have 



E[Zn+l{K)\Tn] = 



E ([Yn-l{K) + {lK{Xn,Xn+l) " E [1k{X 

n, -^n+l) I ^n])]^ | J' 

Ztj^[^K{Xi,Xi+,)\Xi] 

Y^_^{K) + vav{lK{Xn,Xn+i) I Xn) 



Er=,-E[ii,(Xi,x,+i) 

< 7 (K\^ E[l;^(X„,X„+i)|X„] 



Er=,-E[iK(x,,x,+i)|x,]- 
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Thus, since A' is also ^„-measurable, 
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The result ensues from induction. 



A A' 



□ 



Claim 2. For all sequence {un)n>o in [0, 1], and j >0 such that Uj ^ 0, 

n-l 

^ ^ < 1 + logn - loguj. 



k=j 2^i=j 



Proof of Claim 2. Let / be any non-negative continuous function such that uy^ = J^^^ f{t) dt 
whatever k Let F be the primitive of / such that F{j) = 0. Then, 



n-l 
k=i ^i=j 



n-l fk+1 
k=j+l 

1+ E 



m 



< 



k F{k + 1) 

fit) 



dt 



Fit) 



dt 



k=j+l ■ 

< 1 + log F{n) - log F{j + 1) 

< 1 +log ^Ufc j - logUj. 



□ 



E 



By using Claim 1 with A' = [T = j], 



Y:il^¥.[lK{Xi,Xi+i)\Xi 



+E 



n—2 I n— 1 



3=0 



{lK{Xn-l,Xn) - ^[lK{Xn-l,Xn) \ Xn-l]f 
E[l^(X„_i,X„) 



lT=n-l 



Now, 



E 



lT=n- 



=E 



var (1a'(X„_i,X„) 1 



E[l;^(X„_i,X„) 

< P(r = n-1). 
We then use Claim 2 with Ufe = E Xjt+i) | Xj.] to derive 

fe' r^l'f '^'^ii ^^-^l ^ X:^[a + log2 + 21ogn)l..,] 



lT=n- 



,=0 Vfc=J-^^=^-^[^^^^^'^^+i)l^^] 



j=0 



< (l + log2 + 21ogn)P(r7^n-l). 
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Finally, E [Bk] < 4 + 2 log 2 + 4 log ra and hence 

'- n 
which concludes the proof. □ 

6.2. Proof of Theorem 2. When i < n, the result ensues from the following theorem whose 
proof is delayed to Section 6.3. In the theorem below, the constant Lq = 90 can easily be 
improved but it seems to be difficult to obtain the value Lq = 0.03 used in practice. 

Theorem 7. For all L > 90 and 1 < £ <n, the estimator s = s{L,£) satisfies 



CH^ {si A, s) > inf (H^ {sIa, Sm) + pen(m)) + ^ 



< 3e 



ve>o, 

where C is an universal positive constant. 

By integrating the inequality above, there exists C" > such that 

C'E [H^ {si A, s)] < inf {E [H^ {si a, Sm)] + pen(m)} 

and the conclusion follows from Proposition 1. 

When i is larger than n, we use the lemma below whose proof is postponed to Section 6.4. 

Lemma 4. For all L > 15 and £ > n + 1, s{L,£) = s{L,n) and s{L, oo) = s{L,n). 

Consequently, if^>n + lor£ = oo, 

C'E [H'^ {si A, s)] < inf {E [H^ {si a, Vm)] + pen(m)} . 
meM„ 

Let m* G Me such that 

2 inf {E [H^ {sIa, V^)] + pen(m)} > E [H^ {sIa, + pen(m*). 



Since 



inf (E [H^ {si A, Vm)] + pen(m)} < J + L^^, 



we deduce L\m*\\og{n)/n < l + 2Llog(n)/n and thus \m*\ < 2 + n/(L log n) < n. Remark now 
that the cardinal of a partition m G Me \ Mn can be lower bounded by 

\m\ > 4"^ + (4^^ - l)n > n + 1. 

Consequently, m* G Mn and hence, 

inf (E [H^ {si A, Vm)] + pen(m)} < 2 inf (E [H^ {si a, Vm)] + pen(m)} 

which completes the proof. □ 
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6.3. Proof of Theorem 7. The proof of this theorem requires the two following lemmas whose 
proofs are postponed to Sections 6.3.1 and 6.3.2. 

Lemma 5. For all m G Aig, there exists a deterministic set Sm containing Sm such that 
71 (m) = sup [aH"^ (sm, Sm')+T (s^, Sm') - pen(m')} + pen(m) 

and 

72 (m) = sup {aH'^ [sm, /') + T (s^, /') - pen(m')} + 2pen(m) 
m'eMt. 

satisfy 

7i(»Ti) < l(jn) < 72 (m). 

Lemma 6. Set e = (2 + 3\/2) /8. Under assumptions of Theorem 7, for all ^ > 0, there exists 
an event Cl^ such that P > 1 — 3e~"^ and on which, 

(15) for all partition m G M^, 

sup {{1 - a) H^sIaJ') +T{sm,f')-pen{m')} < {1 + e) {si a, Sm) + pen{m) + 22^ 

m'eMe 

where Sm' is defined in Lemma 5. 

Proof of Theorem 7. On fi^, for all m G Me, 

sup {(1 - £) {si A, f) + T {sm, f) - pen(m')} <{l + s) {si a, Sm) + pen(m) + 22^. 
m'eMe 

If T {sm, Sm) + pen(m) - pen(m) > 0, 

aH^{slA,Sm) < (1 - e)-??^ (sIa, Sm) +7'(sm,Sm) - pen(m) + pen(m) 
< (1 + £) {si A, Sm) + 2pen(m) + 22^ 
since a < 1 — e and since Sm belongs to ^m'^Me'^m' ■ 
If T {sm, Sm) + pen(m) - pen(m) < 0, 

Sm, Sm) < {§ 

mj Sm) + T ( 

Sm, Sm) - pen(m) + pen(m) 

< sup { aH^ {sm, Sm') + T {sm, Sm') - pen(m' ) } + pen(m) 

< 7i{m). 
Consequently, by Lemma 5, 

7i(m) < j{m) + - < 72(m) + -, 
n n 

which implies that 

aH'^{sm,Sm)< sup {aiJ^ (s^, /') + T (s^, /') - pen(m')} + 2pen(m) + -. 
m'eMe 

f'^^m' 



26 MATHIEU SART 

With V = (1 -e)/Q - 1 > 0, 

aH'^{Sm,Sra) < {l + y-^) {Sm, sIa) 

+ sup {{l-s)H^{slA,f')+T{sm,f')-pen{m')} + 2pen{m) + - 

m'eMe 

< (1 + y-^) sIa) + [(1 + e) (sIa, Sm) + pen(m) + 22^] + 2pen(m) + 

< (2 + £ + v-^) {sm, si a) + 3pen(m) + 22^ + -. 

n 

This leads to 

aH^{slA,Sm) < 2aH^{slA,Sm) + 2aH^(sm,s^) 

< 2(2 + a + e + y-^) sIa) + 6pcn(m) + 44^ + -. 

Finally, we have proved that there exists C > 0, such that, with probability larger than 1 — 3e~"^, 
for all m G Aig, 

CH^ {si A, Sra) < {sm, sIa) + pen(m) + ^ 
This concludes the proof. □ 

6.3.1. Proof of Lemma 5. 

Claim 3. Let, for all K G yjm&j^^m, Ki, . . . ,Ki be the cubes of UmeMeiT^ such that K C Ki 
for all i G {!,...,/}. For all i G {!,...,/}, let li and Ji be the subsets of [0, 1]*^ such that 
Ki = li X Ji. Set 



-Ik, ae{0,...,n},be{l,...,n} 



with the convention a/0 = whatever a G {0, . . . , n}. Then \Sk\ < in{n + 1), and 

(16) yK'eUrr.,Mem,KcK', ^5=/ '^'^^^ Ik e Sk. 

Lj=o Jx^K'{Xi,x)di^{x) 



We then define 



I Kern J 



where 5;^ is given by the claim above and introduce the random set 

Sm = Stuk'^k, V-ftT G m, rriK G Me > 

I /fern J 

For all / G Sm, we denote by mxif) any partition of Mg such that 

f(x) = s for all x G K 
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and consider the partition 

m{f)= U {mK{f)VK). 

Kem 

By definition, 

7(m) = 2pen(m) + ^ sup aH^ (sm'^K, Sm' ^k) + T (sm^K, s^' ^k) - pen(m^ V K) 
and thus 

7(m) = sup \ \aH'^ {s^Ik, Hk) + T (l^l/f , /Ik) - pen(mif (/) V K)] I + 2pen(m) 
= sup jai/^ (^Sm, +T [s„,, - pen(m(/))| + 2pen(m). 

f^Sm 

Now, for all m,m' € Mi, the estimator Sm', belongs to Sm with 

m{sm') = {Kn K', {K, K')emx m' , K D K' ^ ^ , 

which leads to 

7(m) > sup {aH"^ {srn,Srn') +T {srn,Srn') -pe'a.{rn{srn'))} + 2pen{rn). 

Since m{sra') C mUm', \m{sm')\ < \m\ + \m'\ and 71 (m) < 7(m). 

Let us now prove the inequality 7 < 72- A function / G Sm is constant on each set of the 
partition m{f). For K' G m{f), there exist Kem, K" G mxif) with if' C if" such that 

By relation (16), flx' € S^' and thus / = J2K'&m{s) f^K' belongs to S^^^y Consequently, 
Sm C ^m'&Mi^m' and the conclusion follows. □ 

6.3.2. Proof of Lemma 6. We start with the claim below. 
Claim 4. Let ip be the function defined on [0, +00)^ by 

ip{x,y) = ^ ^^^ for all x,y £ [0, +00) 

with the convention 0/0 = 0. 

Let, for all f,f' G L^(X^,M), with support included in A, Z(f,f') be the random variable 
defined by 

Then, 

(17) (1 - i=) {si A, f) + T (/, /') < (1 + ^) {si A, f) + Z (/, /') 



28 

and 



MATHIEU SART 



1 ""^ r 

(18) -V / ij^f{Xi,y),f'iXi,y)) dtx{y)<3{H\slA,f)+H\slA,f)). 

Proof. These inequalities can be obtained by using the same arguments as those used in the 
proofs of Propositions 2 and 3 of Baraud (2010). □ 

We shall prove (15) by applying the following concentration inequality to the random variable 
^ (/,/')■ 

Claim 5. For all i < n — 1, let Fi he the a-field generated by the random variables Xj for 
j G {0, . . . , i}. Let /i, . . . , /n G L^(X^, M) such that there exists 6 G M with sup^^^^ \fi{x)\ ^ b 
for all i & {0, . . . ,n — 1}. Set 



i=0 



and 



n-l 



Vn = J2^[fhXi,Xi+^)\Ti]. 



Then, for all P > b and x > 



1=0 



Sr,> 



Vn 



_ " - 2(/3 - b) 



+ I3x 



Proof By setting a'^ = 2(/3 - b), 

logP [Sn > aVn + Px] < -x + logE [exp - a(3-'Vn)] 

< -x + logE [exp {(3-^Sn-i - af3-^Vn) E [exp {p-\Sn - Sn-l)) \ J^n-l]] 
By using Bernstein inequality (Proposition 2.9 of Massart (2003)), 

E [exp {r\S^ - 5„_i)) I Tn-i] < exp (^^W^) 

and thus 

logP [Sn > aVn + Px] <-x + logE [exp {^^Sn-i - ar^Vn-i)] . 
The result follows by induction. □ 

Proof of Lemma 6. Set z = {l- 1/^2) /4, ^ = (3/z + V2)/2 and for all ^ > 0, 



= < 



/')S!x5^, ^ {H'if, sIa) + H^W, sIa)) + pen(m) + pen(m') + (3^ 



< 1 



{m,m')eMj 



On ri^, for all m, m' G (/, /') G S'm x Sm', 

Z{f, f) < z {H\f, si a) + H\f', si a)) + pen(m) + pen(m') + 
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and (15) derives from (17) (with e = l/\/2 + z). 
It remains to prove that P (ri|) < 3e~"^. We have 

P {^D < 5] P [Z{f, f) > z [H\s1a, /) + H\s1a, /)] + pen(m) + pen(m') + /^C] . 

(/,/')65™xS„, 
{m,m')&Ml 

We apply the concentration inequality given by Claim 5 with fi = ip (/, f), Sn = nZ(f, /') and 
by using relation (18), 

n— 1 

Vn = ^E[fi{Xi,Xi+,)\J'i] <3n{H\slAj) + H\slA,f')). 

i=0 

We obtain for all x > 0, 

Note that z = 3/(\/2(/5\/2 — 1)). By using the inequality above with 

f3— = pen(m) + pen(m') + 
n 

we deduce that 

p (J^c^ < ^ g-n(/3-ipen(m)+;9-ipen(m')+^)_ 

(m,m')eA4| 

Now, by Claim 3, since £ < n, log \Sm\ < 3|m| log(n+l) and thus /3~^pen(m) > (|m|+log {SmD/n 
for all m G Al^. Consequently, 

p (QC^ < ^ g-(|m|+log|5^|+|m'|+log|5^,|+n0 

(/,/')6S,nX5„ 

2 

The conclusion follows from the inequality J2meMe fi '"*' ^ (^^^ Section 3.2.4 of Baraud and 
Birge (2009)). □ 

6.4. Proof of Lemma 4. The lemma follows from the two claims below. 
Claim 6. Let for each mi, m2 G Moo and K G mi, 

7i^(mi,m2) = aif^ {sm^lK, Smzli^) + T(sto^1x,Sto2 1k) - pen(m2 V K). 
Then, for all i i>n + 1, nii G Moo, K G nii, 

sup 7A-(mi, 777-2)= sup 7^(7771,7772) 

arid i/7MS 

sup 7ii:(777l, 7772) = SUp 7if (r77l, 7772). 



{m,m')eMj 



< 
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Proof. Let 1712 ^ such that 7/^(7711,7712) = sup„2g_yy^^, 7/^(7771, 7712). In Section 2, wc have 
defined the coUection M.£ of partitions of [0, 1]^'^. Likewise, by using the algorithm of DeVore 
and Yu (1990), we define the collection Me{K) of partitions of K. Note that 7712 V K belongs 
to M.e{K). Since {sm'^K, Sm'l/f ) < 1 Siiid \T {sm'^K, Sm'lftr) I < 2, we have 

^ *^ ^ Q |777^ Vi^llogn 

77 

Remark that 

7i^(mi,r77^) >7x(mi,{[0,lp'^}) > 

which leads to 

Imny K\ <1 + — < 77. 

L log 77 

This implies that m\\J K belongs to M.n{K). There exists 777* G M.n such that mly K = 7773 VK 
and hence 7/^(7771,7772) = 7i<:(?T7i, ^2) which concludes the proof. □ 

Claim 7. Set for all m G M.00 cind K e m, 

lK{m)= sup ^Kirn,m2). 

Then, 7(777) = 2pen(r77) + Y^Kem iKijn) and for all £ G N*, £ > 77 + 1, 

inf 7(777) = inf 7(777) 



and thus 



inf 7(777) = inf 7(777). 



Proof Let 777* G Aii such that inf^eA^^ 7(777) = 7(777*). By Lemma 5, 

7(777*) > sup {aH'^{sm,Sm') + T{sm,Sm')-pen{m')} + L 

m'eMe 

> ( -2 - + ^ |777*|log77 

~ \ n J n 

^ {\m*\ — 1) log 77 

> -2 + L- ! ' ^ . 



Now, 



which implies that 



77 



l{m*) < 7({[0, 1]'''}) < + 3 

77 



, J., 5r7 

\m*\ < 3 + — < 77 

L log 77 



and thus 777* G Mn- D 
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6.5. Proof of Theorem 3. Consider the regular partition m^e/ of [0, l]^*^ into cubes with side 
length 2~^, that is 



m,r. 



,e/ = {i^^,i,l=(fc,...,fc), A;G{1,...,2^}} 
where K^^i is defined in Section 2.2. For all partition m G M.^, Vm C Vm^^f Set 

^eq = [V5l,52 e Vm^^f, h'^{9l,92) < ^2)] 

and define Sm an element of Vm such that h'^{slA, Sm) = h?{stA, Vm)- 
For all m G Me, 

< 2E [h^ (si A, Sm) lOeJ + 2E [h^ {Sm, Srh) lOeJ + E [/l^ (si A, Sm) lag 

< 2E [h^ {si A, Sm) iQeJ + 22E [i?^ (s^, s^) l^^^J + E [/t^ (si^, s^) i^^^ 

< 2E [h^ {si A, Sm) In J + 44E [H^ {si a, Sm) IfieJ + 44E [H^ {sm, si a) IqJ 



+E 



Now, /i2 (si^^ s^) = (si^^ Sm)] = h'^{slA, Vm) and 



E 



h'^ {slA,Sm) ln= 



< 2E 

< E 



{h\s,0) + h\sm,0)) Inc 



1 + 2 sup h'^{sm,0) h 



Let for all K e m, Ik and Jk be the subsets of [0, l]'^ such that K = Ik x Jk- Then, 



dx < \m\ 



Since m C m. 



ref, 



\m\ < \m, 



ref] 



4^^ and thus, 



C'E {si A, Sm)] < inf {/i' {si A, Vm) + pen(m)} + 4^'^P (J^^J 

for some universal constant C > 0. 

We now bound from above the term P (Og^). We denote by I^e/ the regular partition of [0, 1]^ 
into cubes with side length 2~^. Remark that 



i^lq) < 



11 



n—l 



31 e iref, P (Xi e /) > - V i/(x 



i=0 



< 2^'^ sup 



/ei. 



re/ 



1 in 

- ^ (1,(X,) - P (X, G /)) < - -P (Xi G /) 



j=0 



We use the following Bennett-type inequality for /3-mixing random variables (with / = — 1/, 
v = F{XieI),c = 0,^ = 10/llP {Xi G /)). 



32 MATHIEU SART 

Proposition 8. Let (Xj)j>i be a stationary sequence of random vectors with values in M'^, and 
let f be a real-valued function on M*^ bounded from above by c> such that v = IE [/(Xj)^] < oo. 

Then, for all q E {1, . . . ,n} and ^ > 0, 



We then have for all / G I^e/, 



which concludes the proof. □ 

Proof of Proposition 8. Let Z be the smallest integer larger than n/ (2g). We derive from Berbee's 
lemma and more precisely from Viennet (1997) (page 484) that there exist X*, . . . ,X2iq such 
that 

• For j = the random vectors 

= (-^20-1)9+1, • • • , X2(j-i)q+g) and X* ^ = {X2(^j_i-^g_^^, X2(^j_i^g_^_g) 
have the same distribution, and so have the random vectors 

Xj,2 = (X2(j_i)q+5+i, . . . , X2jq) and X|_2 = {X^j_;^-^g_^_g_^_i, X2jq). 

• The random vectors i, ■ • • , X^^ are independent. The random vectors X* 2) ■ • • > -^^2 
are also independent. 

• The event 

^^*= n ([x.-,i7^x^,i]n[x,,2 7^x*2]) 

l<j<l 

satisfies P [(^7*)^] < 2l(3q. 
We set gi{x) = f{x) if i < n and gi{x) = otherwise. For j G {1, . . . , Z}, we set 

9j,l{xi, ■■■,Xq) =^g2{j-l)q+i{Xi) and 5^,2(3^1! ■■■,Xq) = ^g2(j-l)q+q+i{Xi). 
i=l i=l 

Then, 



n \ 

-^{g,ix,) - E[g,{x^)]) > ^] nn* 



n . 

^ 1=1 



+P IE (5^,2(X|,2) - E [5^,2(X*2)]) > </2 



< 2 exp 



8g {nv + cn^/6) 
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by using Proposition 2.8 of Massart (2003). □ 

6.6. Proof of Corollary 2. The corollary ensues from the claim below and Theorem 2 of Ba- 
raud and Birge (2009). 

Claim 8. Under Assumption 2, for all £ eN* such that 1^'^ > n, 

inf dl ViU, Vj, + ' ' ^ < 4 irif dl {^s\a, + ' ' ^ • 

Proof. For all partition m G TWoo and cube K e m, we denote by Ik and Jk the cubes of [0, 1]*^ 
such that K = Ik X Jk and set 

/A--5(a;,y)dxdy 

In this paper, c?2 stands for the standard cuclidean distance of L^([0, l]^'^, /ii^^). In this proof, we 
make a small abuse of notations by denoting by d^ the standard euclidean distance of L^(M^°', [i® 
„). 

Let m* be a partition of Moo such that 

^ ■ r f ,9 / /-^ \ Imllognl ^ ,o / \ |m*|logn 

Let C be the collection C = {K G m*, h{Ik) > 2"^'^} and let m* be a partition of Me containing C 
such that 

|m*| = inf{|m|, m G Me such that m 3 C}. 
Let A' be the set defined by A' = UK&m'K and V^. = / G Hn*} • 

We have, 

dl {y/^lA,Vm') < dl {V^1a',V;^,) +dl {V^lAn{A'r,0) 

and 

dl {^/slA^,V^,) < dl {^/slA^,^/^'i-A^) < dl {^/slA,^/^) ■ 

By using Lemma 2 of Baraud and Birge (2009), dl {-s/sIa, s/^m*) < 2d| (-y/sl^, I/^*) which 
shows that 

dl {^/^lA,Vm^) < 2dl {^/^lA,Vm*) + dl [V^l An{A'r , O) ■ 

Now, 

dl{VslAn{A'y,0) < y2 \ s{x,y)dy) dx 



Kem*\C ■ 



Kem*\C 

Since \m*\ < |m*|, we have 

,9/ -r. \ jm'jlogn ^ ,0 / \ (1 + logn)|m*| 

dl {^/^lA, Vm') + ^ < 2dl VilA, Kn* + ^ — 

which proves the claim. □ 



34 MATHIEU SART 

6.7. Rates of convergences for h. We prove the result only for geometrically /3-mixing chains 
(the proof for arithmetically /3-mixing chains being similar) . We use the claim below whose proof 
is the same than the one of Claim 8. 

Claim 9. Under Assumption 2, for all £ G N* such that > nj log^ n, 

.2/ , , ^ , |m|logn\^^ . _ r,2, . , X , \m\\oin 



inf \h'{s\A.Vm)+ ' ' ^ inf <^ /i^ (sIa, Hn) + 

By using this claim and Theorem 1 of Akakpo (2012), 

2d_ flog^n\^ log^n Rn{^) 



(19) CE < l^ul^f C^) "V >2|^ + 

and by using Theorem 2 of Akakpo (2012), 

L J I \p,a y n J n n 

where C > depends only on K,a,d,p and where 

If (7 > <Ti(p, d) then ^ > 1/2. There exits thus no (depending only on 9), such that if n > no, 
2-2£ed £ log n/n, and hence 

^/^r,9/ . . XT I ^, 1^ /logn\ <^+'' logn -Rn(-^) 
L J I ip,a y ri J n n 

If n < no, we deduce from (19), 



n^\h-i( t - W <, \ r\ |a^/^log^y+^' log^ i2n(£) 



n 



< C" 



where C" depends only on a,d,p. The conclusion ensues from the fact that Rn{() is upper- 
bounded by a constant depending only on koj ^i- D 

6.8. Proof of Proposition 4. We shall use the following lemma whose proof is similar to the 
one of Lemma 6. 

Lemma 7. Set s = (2-|-3'\/2)/8. Under assumptions of Proposition 4, there exists an universal 
constant Lq > such that for all L > Lq and ^ > 0, 

V/, /' eS, (1 - e) {si A, f) + T (/, /') <{l + e) {si a, f) + lM/1±M/1 + 22^ 



with probability larger than 1 — e 



ESTIMATION OF THE TRANSITION DENSITY OF A MARKOV CHAIN 35 

Proof of Proposition 4- By using the lemma above, with probabihty larger than 1 — e~"'^, for all 
f&S, 

sup \{l-s) {si A, f) + T (/, /') - 1 < (1 + ^) ^2 ^ ^^sif) ^ 22^_ 

/'e5 I ) ^ 

Thus, if T{f, /) + - > 0, 

aH\slAj) < {l-e)H\slAj)+T{fJ)-L^^ + L^^ 

< {l + e)H^slA,f) + 2L^^ + 22^. 

UT{fJ) + L^-L^<0, 

aH\fJ) < aH\f,f) + T{f,f)-L^^ + L^^ 

n n 

< snJaH\f,f')+nf,f')-L^^\+L^ 
f'GS [ n ] n 

< pU) 

< Pif) + l 

< sup \aH^ if, f) + T if, f) - I + + 1 . 

With V = (1 -e)/a- 1 > 0, 
aH\fJ) < {l + v-')H^f,slA) 



+ sup I (1 - 8)H^ {si A, f) + T (/, /') - L 
f'es L 

< {l + v-^)H^{f,slA) + 



Asif')} ^ ^Asif) ^ 1 



n \ n n 



n ri 



< (2 + £ + v-^)H^{f,slA) + 2L^^ + 22i+-. 
This leads to, 

aH\slA,f) < 2aH''{slA,f) + 2aH\f,f) 

< 2 (2 + a + e + y-^) (/, sIa) + 4L^^ + 44^+-. 

Finally, we have proved that there exists C > 0, such that, with probability larger than 1 — e^"^, 
for all f £S, 

CH\s1a, /) < if, sIa) + + e 

The conclusion follows. □ 
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6.9. Proof of Corollary 4. Throughout this proof, the distance associated to the supremum 
norm || • ||oo is denoted by doo- We shaU use the following lemma (the first part may be deduced 
from the work of Akakpo (2012) whereas the second part may be deduced from results in Dahmen 
et al. (1980)). 

Lemma 8. There exists a collection W of (finite dim,ensional) linear spaces such that for all 
p G (0, +oo], P > {1/p - 1/2)+ and f € ^^(LP([0, 1])), L>0,t>0, a>0, 

CM^{L''dl''{g,W) + {dimW)T} < {L\g\;^^)^^ + r 

where C > depends only on p, 13. Moreover, for all(3>0, f £ ^^([0, 1]), L > 0, r > 0, a > 0, 

C ud^ {L'd^^ig, W) + (dim W)t} < (L|5|^,^) ^ ^^Iffi + ^ 
where C > depends only on fi. 
Let us define 

u{x,y) = ^ — and ^{x) = {{1 + \\g\\^)x) for all x, y G [0, 1]. 

1 "I" ll^lloo 

Let W be the family of linear spaces given by the lemma above. Define, for all G W, the 
linear space 

Tw = {{x, y) ^ a{y - f{x)), a G M, / G W} 
and T = {Ty/, W G W}. Since $ belongs to ^'^([0, 1]), we deduce from Corollary 1 of Baraud 
and Birge (2011) and from our Theorem 5 that there exists an estimator s such that 

logn 



n 



[H^ is, I)] < mf^ {|$|^,,^id^("^^)(«,T) + (dimr)r„} + ^nf^ |d^($, W) + (dim W^)- 
where C > depends on a, n and where 

log n 

Tn = (logn V log (|$|oo,aAl)) 

Now, 

mf {|$Poo,.AW?'''^(^,r) + (dimr)r„} < M^{\<P\l^,^,df'''''\g,W) + (dimW + l)r„} 
and the conclusion follows from the lemma above. □ 

6.10. Proof of Lemma 1. The first part of the lemma may be deduced from Proposition 4 
of Baraud and Birgc (2011). For the second part, we shall build (p' G 'W^(M) such that '^'|[o,i] 
U6>a^^([0, 1]) and g' G '^^([0, 1]) such that g'{0) = and 

^'og'G n'^^'^^HiO, 1]) \ U.^eiM'^Mo, 1]). 

By setting (p = (f)' and g = —g', the function / defined by 

f{x,y) =(/)' {y- {-g'{x))) for all x,y G [0, 1], 

is suitable since f{x, 0) = 0' o g'{x) and /(O, y) = (p'{y). 

U a,P < 1, we can choose 4>'{x) = x" on [0, 1] and ^'(x) = x^ . If /3 > a V 1, then choose 
G such that 0'|[o,i] U6>o.'H''([0, 1]) and g'(x) = x. If now, a > /3 V 1, we choose 0' G 
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n^iR) such that (/)'|[o,i] Ub>^^^([0, 1]) and such that (j)'{x) = x for aU x € [0, 1/2]. We then 
consider C G ^'^([O, l]j \ U,^f,n'{[0, 1]) and g'{x) = (Cix) - C(0))/(2 sup^^p,!] iCiv) - C(0)|). □ 

6.11. Proof of Corollary 5. Throughout this proof, doo stands for the distance associated to 
the supremum norm || • ||oo- Let us define 

y-vi{x) V2ix) V3{x) 



yx,y,z e[Q,l], u{x,y) = {ui{x,y),U2{x,y),U3{x,y)) , , . , ,, , ,, , 

V-l- + Pllloo IP2II00 IPsI 

^{x,y,z) = ||v3||oo-2;<^((l + ||'yi||oo)lk2||ooicy) . 
Let W be the family of hnear spaces given by Lemma 8. Define, for allW GW the hnear spaces 
Tw = {{x,y)^a{y-f{x)),aeR,feW} and Fw = {{x,y, z) ^ zf{xy), f e W} 

and set Ti = {Tw, W G W}, T2 = W, T3 = W, F = {Fw, W G W}. 

It ensues from Corollary 1 of Baraud and Birge (2011) and our Theorem 5 that there exists 
an estimator s such that 

CK [H' {s,s)] < mf^ {||.3||L(1 + ll-i||oo)^('^^^)||^2||^'^^^VPoo,.rf^^'^"'^(«i,?^) + (dimr)r«} 
+ mf J||.3||L(1 + lbi||oo)^('^"^)|b2||^-^^)|^lL,.rff "^^(^2,T) + (dimT)r(2)} 
+ M^{\\v3\\lMldl{us,T) + (dimT)r^3)| 

logn 



+ inf {di^{^,F) + [dim F) 
Few 1 ' ^ ^ ^ n 



where 



logn 



n 
logn 

n 



r« = (logn V log (||.3||L(l + ll-i||oo)^('^"^VlL,.ll-2||^'^"^) 

r(^) = (logn V log (||^3||L(l + lbi||oo)^('^"^VlL,.ll-2fi'^"^))) 

r(=^) = (logn V log (||.3||Lll^liy)^- 

Hence, 



+ M^^{M14^''"'\vs,W) + (dimT^)r(3)} 



+ inf U\v3\\l,dl,{^,W) + {dimW}^^ 
Calculating these minimums via Lemma 8 leads to the result. □ 
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6.12. Proof of Lemma 2. The first part of tlie lemma can be deduced from Proposition 4 
of Baraud and Birge (2011). For the second part, remark that, as in the proof of Lemma 1 the 
problem amounts to finding (f)' G H^(M) with (/)'|[o,i] ^ Ua>(,H"(M), v\ G ^^'([0, 1]) for i G {1,2}, 
v[{Qi) = 0, v^(0) = 1 such that 

'^.cp'ivW,) G n'^^'M^io, 1]) \ U n'{[Q, 1]). 

6>6i(/3i,/32,o-) 

If 6{(3i, /32,(t) = 2~^(/32 A 1), choose V2{x) = (1 — x)^^^"^ and take cj)' as being any function 
of WiM) such that 0'|[o,i] ^ U„>^^"(M) and such that (?!)'(0) = 1. If e{l3i,P2,cr) = cr, choose 
v[{x) = 2(\/l + X— 1), V2{x) = l/2(-v/ir+~a; + l) and take </>' as being any function of 'H°'(R) such 
that 0'|[o,i] Ua>o-'H"(M). If ^(/3i, /32, a) = o"/3i, we may assume that cr < 1 and /3i < 1. We can 
then choose v[{x) = x^^, V2{x) = 1 and (p'{x) = x" for x G [0, 1]. If 9{/3i, /32,cr) = (^^2, we may 
assume that a < 1 and /92 < 1 and choose ^^(a;) = 1 for x G [1/2, 1], V2{x) = 1 — (1 — x)^'^ for 
x G [1/2, 1] and 4>'{x) = (1 — x)'^ for x G [0, 1]. Finally, if 9{f3i, l32, cr) = f3i, we may assume that 
^1 < 1. We can then choose v[{x) = x^^, V2{x) = (1 — x)^^^'^ and such that ^'(x) = x for 
xG [0,1/2]. □ 

6.13. Proof of Proposition 6. We proceed in 3 steps. 

Step 1. We associate to each cube K G UmeMei^i ^ place in the computer's memory. Then, for 
each i G {1, . . . ,n} we determine the sets K G Um^Mi''^ such that -^i+i) > 0. 

There are at most i such sets. This permits to store all the Y17=o '^^i^i, Xi^i) in 
around 0{nid) operations. Let for all K G UmeMei^: Ik and Jk be the subsets of 
[0, 1]*^ such that K = Ik x Jk- Wc can store all the ij^{Jk) in 0(4^'^) operations and all 
the Y17=o '^ixi-^i) ™ 0{nid) operations. This permits us to store quickly 

71— 1 n— 1 „ 

Vlx(^i,^i+i) and V / lK{Xi,x)dfi{x) 

for all K G Umex^m. These values have to be calculated to know the Fk{K') and thus 
to use the algorithm presented in Section 5. 
Step 2. For each K G UmeMeiTT-} we use the algorithm of Section 5 to design m^. Let us 
denote by j G {0, . . . , i} the smallest integer such that K G /Cj where Kj is defined in 
Section 2.2. 

• To find m'j^, we begin to compute £{T*{K")) for all K" G ^meMe\Mi-i^ such that 
K" n ii' 7^ 0. The complexity of this is around the number of such sets, i.e, A^^"^^'^. 

• Next, thanks to relation (12) we compute £{T*{K")) for all K" G ^meMe-i\Me-2''^ 
such that K" Ci K ^ 9. There are 4^^~^~^)^ such sets. The complexity of this 
operation is thus 4*^ x 4^^^^^^^'^. 

• By recurrence, we compute £{T*{K")) for all K" G ^meMeXMj''^ ^^^^ K" n 
K 7^ in at most 

e-j-i 

^(e-j)d _j_ X ^ < 3 X 4^^~^^'^ 

k=l 

operations. 
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• Wc get then £{T*{[0, 1]"')) in additional operations. 
We apply this algorithm for all K G U^eX^^T- When K G ICj, computing requires 
thus O (4:^^~^^^ + A'^'j'j operations. Since \ICj\ = 4^*^, computing all the m'^ requires 
finally 

e 

3=0 

operations. 

Step 3. Now, by slightly modifying the algorithm, we can compute (10) in O (4^^+^)'^) operations. 

□ 

Acknowledgements: many thanks to Yannick Baraud for his suggestions, comments, careful 
reading of the paper. Wc arc thankful to Claire Lacour for sending us the source code of the 
procedure of Akakpo and Lacour (2011). 
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